Skip to content

fix: rna2vec case-insensitivity for secondary structures#608

Open
agam263 wants to merge 1 commit into
gc-os-ai:mainfrom
agam263:fix-rna2vec-case-sensitivity
Open

fix: rna2vec case-insensitivity for secondary structures#608
agam263 wants to merge 1 commit into
gc-os-ai:mainfrom
agam263:fix-rna2vec-case-sensitivity

Conversation

@agam263

@agam263 agam263 commented May 1, 2026

Copy link
Copy Markdown

Fix: Silent Data Corruption in rna2vec due to Case Sensitivity

fix #606

📖 Summary

This PR resolves a critical bug in the rna2vec utility function where secondary structure sequences were processed in a case-sensitive manner. This caused any lowercase input to fail triplet matching silently, returning vectors populated entirely by zeros.

🔍 Technical Root Cause

The rna2vec function transforms biological sequences into numerical representations by extracting overlapping triplets and mapping them to indices via a pre-generated vocabulary.

  • The Discrepancy: While RNA sequences were normalized through dna2rna(), secondary structure (SS) sequences were taken directly from the input.
  • The Failure: The internal triplet dictionary is generated using uppercase characters (S, H, M, I, B, X, E). When a user provided a lowercase sequence like "sshh", the triplet lookup triplets.get("ssh", 0) would fail to find a match and default to 0 (the padding index).
  • The Consequence: Instead of a meaningful numerical representation, the model would receive an all-zero vector, effectively "masking" valid data without warning the user.

🛠️ Proposed Changes

1. Robust Normalization

Updated the core loop in pyaptamer/utils/_rna.py to enforce .upper() normalization on every sequence at the start of the iteration. This ensures that:

  • RNA sequences remain robust.
  • Secondary structure sequences are now case-insensitive.
  • Future sequence types added to the loop will inherit this safety by default.

2. Regression Testing

Introduced a new test file: pyaptamer/utils/tests/test_rna2vec_robustness.py.

  • Test Case: Specifically validates that uppercase (SSHH) and lowercase (sshh) inputs yield identical, non-zero NumPy arrays.
  • Verification: This ensures that valid structural information is correctly encoded regardless of the input's casing.

⚠️ Impact & Risks

  • Data Integrity: This fix prevents "silent failures" where models appear to be training but are actually receiving empty data.
  • Backward Compatibility: This is a non-breaking change. It only affects cases that were previously failing/returning zeros.

✅ Verification Results

  • Unit Tests: pytest pyaptamer/utils/tests/test_rna2vec_robustness.py -> Passed.
  • Existing Suite: pytest pyaptamer/utils/tests/test_rna.py -> Passed.
  • Manual Verification: Confirmed that lowercase structural strings now map to their correct triplet indices in the vocabulary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Silent Data Corruption in rna2vec due to Case Sensitivity

1 participant